Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections
نویسنده
چکیده
Abstract The method of stable random projections is popular in data stream computations, data mining, information retrieval, and machine learning, for efficiently computing the lα (0 < α ≤ 2) distances using a small (memory) space, in one pass of the data. We propose algorithms based on (1) the geometric mean estimator, for all 0 < α ≤ 2, and (2) the harmonic mean estimator, only for small α (e.g., α < 0.344). Compared with the previous classical work [27], our main contributions include: • The general sample complexity bound for α 6= 1, 2. For α = 1, [27] provided a nice argument based on the inverse of Cauchy density about the median, leading to a sample complexity bound, although they did not provide the constants and their proof restricted ǫ to be “small enough.” For general α 6= 1, 2, however, the task becomes much more difficult. [27] provided the “conceptual promise” that the sample complexity bound similar to that for α = 1 should exist for general α, if a “non-uniform algorithm based on tquantile” could be implemented. Such a conceptual algorithm was only for supporting the arguments in [27], not a real implementation. We consider this is one of the main problems left open in [27]. In this study, we propose a practical algorithm based on the geometric mean estimator and derive the sample complexity bound for all 0 < α ≤ 2. • The practical and optimal algorithm for α = 0+ The l0 norm is an important case. Stable random projections can provide an approximation to the l0 norm using α → 0+. We provide an algorithm based on the harmonic mean estimator, which is simple and statistically optimal. Its tail bounds are sharper than the bounds derived based on the geometric mean. We also discover a (possibly surprising) fact: in boolean data, stable random projections using α = 0+ with the harmonic mean estimator will be about twice as accurate as (l2) normal random projections. Because highdimensional boolean data are common, we expect this fact will be practically quite useful. • The precise theoretical analysis and practical implications We provide the precise constants in the tail bounds for both the geometric mean and harmonic mean estimators. We also provide the variances (either exact or asymptotic) for the proposed estimators. These results can assist practitioners to choose sample sizes accurately.
منابع مشابه
Very Sparse Stable Random Projections, Estimators and Tail Bounds for Stable Random Projections
The method of stable random projections [39, 41] is popular for data streaming computations, data mining, and machine learning. For example, in data streaming, stable random projections offer a unified, efficient, and elegant methodology for approximating the lα norm of a single data stream, or the lα distance between a pair of streams, for any 0 < α ≤ 2. [18] and [20] applied stable random pro...
متن کاملUsing Stable Random Projections
Abstract Many tasks (e.g., clustering) in machine learning only require the lα distances instead of the original data. For dimension reductions in the lα norm (0 < α ≤ 2), the method of stable random projections can efficiently compute the lα distances in massive datasets (e.g., the Web or massive data streams) in one pass of the data. The estimation task for stable random projections has been ...
متن کاملEfficient lα Distance Approximation for High Dimensional Data Using α-Stable Projection
In recent years, large high-dimensional data sets have become commonplace in a wide range of applications in science and commerce. Techniques for dimension reduction are of primary concern in statistical analysis. Projection methods play an important role. We investigate the use of projection algorithms that exploit properties of the α-stable distributions. We show that lα distances and quasi-d...
متن کاملNonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections
For 1 dimension reduction in l1, the method of Cauchy random projections multiplies the original data matrix A ∈ R with a random matrix R ∈ R (k ≪ min(n,D)) whose entries are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility results, one can not hope to recover the pairwise l1 distances in A from B = AR ∈ R, using linear estimators without incurring large errors. Howev...
متن کاملSign Stable Projections, Sign Cauchy Projections and Chi-Square Kernels
The method of stable random projections is popular for efficiently computing the lα distances in high dimension (where 0 < α ≤ 2), using small space. Because it adopts nonadaptive linear projections, this method is naturally suitable when the data are collected in a dynamic streaming fashion (i.e., turnstile data streams). In this paper, we propose to use only the signs of the projected data an...
متن کامل